In recent decades, Philadelphia has experienced substantial changes in population dynamics, economic fluctuations, and redevelopment pressures. Among these changes, housing values have become a key indicator of the socio-economic conditions within the city’s neighborhoods. Rising housing values typically reflect a higher presence of wealthier residents or a neighborhood undergoing gentrification, while declining housing values often signal increased economic pressure or decline. Therefore, accurately forecasting the median housing values across Philadelphia is crucial for urban planners and policymakers, helping them to proactively address potential risks such as displacement and disinvestment. Supporting Philadelphia’s long-term vision of fostering vibrant, diverse, and resilient communities is also essential.
To better understand the potential factors influencing housing values in Philadelphia, we identify several key variables: educational attainment, vacancy rates, the proportion of single-family homes, and poverty levels. These factors are closely related to housing market trends and provide insights into the neighborhood’s socio-economic status.
Educational attainment is closely linked to the socio-economic characteristics of neighborhoods. Individuals with higher educational attainment typically earn higher incomes and contribute more to local economies. A higher concentration of individuals with advanced educational backgrounds tends to increase the demand for well-designed housing in affluent neighborhoods. With a consistent supply, housing prices and values are likely to rise. Similarly, a high poverty rate also indicates a declining community. Since many residents cannot afford luxury housing, the housing values in those neighborhoods are likely to be Lower.
High vacancy rates often correlate with declining neighborhoods and reduced median house values. Research shows that vacant properties affect multiple facets of community life, such as housing and neighborhood vitality, crime prevention efforts, and the well-being of commercial districts. As a result, areas with numerous vacant housing units typically see diminished median housing values.
The proportion of single-family homes in an area influences housing values. These homes are generally private and comfortable, making them more desirable in many U.S. housing markets. However, they are relatively common in suburban neighborhoods and may suffer from low accessibility and limited infrastructure, which can negatively affect their property values as well.
In this study, we utilize ordinary least squares (OLS) regression to analyze the relationship between these socioeconomic factors and median house values in Philadelphia. By examining these relationships, we aim to identify critical predictors of median housing values throughout Philadelphia and offer insights for decision-makers and community initiatives.
To predict median house values in Philadelphia, we obtained the original dataset from the United States Census data. The dataset represents census block groups from the year 2000 and initially contained 1,816 observations. The key variables included:
To refine the dataset for modeling purposes, we applied the following filtering criteria:
Additionally, we removed a specific block group in North Philadelphia that exhibited inconsistencies, with an unusually high median house value (over $800,000) despite a very low median household income (less than $8,000).
After data cleaning, the final dataset contained 1,720 observations.
We will first examine the summary statistics ( mean and standard deviation (SD)) of key variables in the dataset, including the dependent variables MEDHVAL (Median House Value), and predictors NBELPOV100 (Households Living in Poverty), PCTBACHMOR (% of Individuals with Bachelor’s Degrees or Higher), PCTVACANT(% of Vacant Houses), PCTSINGLES( % of Single House Units).
The mean (\(\bar{X}\)) represents the average value of a variable and is calculated as:
\[ \bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i \]
where:
The mean gives us a single representative value of the dataset. To measure variability, we use the standard deviation (SD), which quantifies how much the values in a dataset deviate from the mean. The formula for the sample standard deviation (\(s\)) is:
\[ s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} \left( X_i - \bar{X} \right)^2} \]
where:
A larger standard deviation indicates that the data points are more spread out, while a smaller standard deviation suggests that the data points are closer to the mean.
We will also examine the histograms and apply log transformations for key variables to assess whether the transformed variables follow a more normal-like distribution.
Histograms provide a visual representation of how a variable’s values are distributed, helping to identify whether the data follows a normal distribution, is right-skewed, or left-skewed. Linear regression model assume that variables are approximately normally distributed.
X-axis: The values of the variable (e.g., house prices, income levels).
Y-axis: The frequency of observations within each bin.
If the histogram is right-skewed, it suggests that a small number of observations have significantly higher values compared to the rest. For variables following a right-skewed distribution, we will apply a log transformation to these variables to improve normality. Since the log transformation is undefined for zero or negative values, we must first check whether any variable contains zero.
By comparing the original histograms with the log-transformed histograms, we will assess whether the transformation improves the suitability of the data for predictive modeling.
We will analyze correlations between predictors, to detect potential multicollinearity before proceeding with regression analysis, which can distort model interpretations.
Multicollinearity occurs when predictors are highly correlated, which can lead to unstable regression coefficients. It also inflates standard errors, reducing the statistical significance of predictors, and increases the risk of overfitting, as redundant variables do not contribute new information to the model.
The correlation coefficient \(r\) is calculated as:
\[ r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2 \sum_{i=1}^{n} (y_i - \bar{y})^2}} \] In a more concise way, this above formula is also equivalent to the following:
\[
r = \frac{1}{n-1} \sum_{i=1}^{n} \left( \frac{x_i - \bar{x}}{S_x}
\right) \left( \frac{y_i - \bar{y}}{S_y} \right)
\] where:
- \(X_i\) and \(Y_i\) are individual data points for
variables \(X\) and \(Y\), respectively.
- \(\bar{X}\) and \(\bar{Y}\) are the mean
values of \(X\) and \(Y\).
- The numerator represents the covariance between \(X\) and \(Y\), while the denominator
normalizes the values.
The correlation coefficient \(r\) ranges from -1 to 1: A value of +1 indicates a perfect positive linear relationship, while a value of -1 indicates a perfect negative linear relationship. A value of 0 suggests no linear relationship between the variables, meaning changes in one do not influence the other.
After getting a general sense of the data, we conduct multiple regression analysis to examine the relationship between the dependent variable, Median House Value (MEDHVAL), and the predictors, education attainment (PCTBACHMOR), number of Households Living in Poverty (NBELPOV100), percentage of Vacant Houses (PCTVACANT), and percentage of Single House Units (PCTSINGLES). Regression analysis is a statistical method to examine the relationship between a dependent variable and one or more predictors. With this type of analysis, researchers can identify the strength and direction of the relationship between variables, make predictions, and assess the significance of predictors. The model also estimates coefficients for each predictor, which represent the expected change in the dependent variable for a one-unit change in the predictor, holding other predictors constant. For this study, the multiple regression model is formulated as follows: \[ \text{LNMEDHVAL} = \beta_0 + \beta_1 \text{PCTVACANT} + \beta_2 \text{PCTSINGLES} + \beta_3 \text{PCTBACHMOR} + \beta_4 \text{log(NBELPOV100)} + \epsilon \] where LNMEDHVAL is the log-transformed median house value, PCTVACANT is the proportion of vacant housing units , PCTSINGLES is the proportion of the single family housing , PCTBACHMOR is the percentage of the residents holding bachelor’s degree or higher, and log(NBELPOV100) is the log-transformed number of households living below the poverty line.
\(\beta_0\) is the intercept, \(\beta_1\), \(\beta_2\), \(\beta_3\), and \(\beta_4\) are the coefficients for each predictor, and \(\epsilon\) is the error term. The coefficient \(\beta_1\), \(\beta_2\), \(\beta_3\), \(\beta_4\) represent the change in the log-transformed median house value for a one-unit change in the corresponding predictor, holding other predictors constant. The error term \(\epsilon\) accounts for the variability in the dependent variable that is not explained by the predictors.
There are several assumptions associated with regression analysis that need to be met for the results to be valid. These assumptions including linearity, independence of observations, homoscedasticity, normality of residuals, no multicollinearity, and no fewer than 10 observations per predictors.
First, Linearity assumes that the relationship between the dependent variable and the predictors is linear. To verify this assumption, we made scatter plots of the dependent variable against each predictor. If the relationship appears to be linear, the assumptions was met.
Second, Independence of Observations assumes that the observations are independent of each other. There should be no spatial or temporal or other forms of dependence in the data.
Third, Homoscedasticity assumes that the variance of the residuals \(\epsilon\) is constant regardless of the values of each level of the predictors. To check this assumption, we made a scatter plot of the standardized residuals against the predicted values. If the residuals are evenly spread around zero, the assumption was met. Any patterns may indicate the presence of heteroscedasticity.
Fourth, Normality of Residuals assumes that the residuals are normally distributed. We examined the histogram of the standardized residuals to check if they are approximately normally distributed. If the histogram is bell-shaped, the assumption was met.
Fifth, No Multicollinearity assumes that the predictors are not highly correlated with each other. We calculated the correlation matrix of the predictors to check for multicollinearity. If the correlation coefficients are is not greater than 0.8 or less than -0.8, the assumption was met.
Finally, No Fewer than 10 Observations per Predictor assumes that there are at least 10 observations for each predictor in the model. Since there are over 1,700 observations in the dataset, this assumption was met.
After verifing the assumptions of the regression, we start the regression analysis. Several parameters need to estimate here: - \(\beta_0\), which is the intercept - \(\beta_1, \dots, \beta_k\), which are coefficients of each independent variables - \(\sigma^2\), the variance of the error terms, which represents the variability in the dependent variable that is not explained by the predictors.
The least squares method estimates the coefficients by minimizing the sum of squared errors (SSE), which is the sum of the squared differences between the observed values and the predicted values. The formula for SSE is:
\[ \text{SSE} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_{i1} - \dots - \hat{\beta}_k x_{ik})^2 \] where \(y_i\) is the actual or observed value, \(\hat{y}_i\) is the predicted value, \(\hat{\beta}_0, \dots, \hat{\beta}_k\) are the estimated coefficients, and \(x_{i1}, \dots, x_{ik}\) are the predictor values for the \(i\)-th observation.
With the Error Sum of Squares (SSE) calculated, we can estimate the variance of the error terms \(\sigma^2\) using the formula:
\[ \sigma^2 = \frac{\text{SSE}}{n - (k+1)} \] where \(n\) is the number of observations and \(k\) is the number of predictors in the model.
We evaluated the model’s fitness using the coefficient of determination \(R^2\) and the adjusted \(R^2\). The coefficient of determination \(R^2\) measures the proportion of variance in the dependent variable that is explained by the predictors. The adjusted \(R^2\) adjusts the \(R^2\) value based on the number of predictors, which help to provide a more accurate measure of model fit for multiple regression, as the increase the number of predictors can artificially inflate the \(R^2\) value.
To obtain \(R^2\), \(SST\) or the total sum of squares, needed to be calculated first. The \(SST\) measures the total variance in the dependent variable, given by: \[ SST = \sum_{i=1}^{n} (y_i - \bar{y})^2 \] Where \(y_i\) is the observed value, and \(\bar{y}\) is the mean of the observed value. Then, the \(R^2\) can be obtained by: \[ R^2 = 1 - \frac{SSE}{SST} \] After that, \(R^2\) is adjusted as follows based on the number of observations \(n\) and the number of predictors \(k\): \[ R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n - 1)}{n - k - 1} \]
In this analysis. several hypothesis tests are conducted to determine the significance of the model and its predictors. The overall significance of the model is assessed using the F-ratio. It compares the variance explained by the model to the variance not explained by the model.A higher F-ratio indicates that the model explains a significant amount of variance in the dependent variable compared to the residual variance.
The null hypothesis and alternative hypothesis for the F-ratio are stated as follows:
The null hypothesis \(H_0\) states that all coefficients are equal to zero, meaning that the predictors do not explain the variance in the dependent variable (median house value). Stated as:
\[ H_0: \beta_1 = \beta_2 = \beta_3 = \beta_4 = 0 \]
The alternative hypothesis \(H_a\) states that at least one coefficient is not equal to zero. In our case, this means that at least one predictor explains the variance in the dependent variable (median house value). Stated as: \[ H_a: \text{At least one } \beta_i \neq 0 \]
We also conduct a t-test to determine the significance of each individual predictor in the model:
The null hypothesis \(H_{0i}\) states that the coefficient for the predictor i is equal to zero, meaning that the predictor does not explain the variance in the dependent variable (median house value). Stated as:
\[ H_{0i}: \beta_i = 0 \]
The alternative hypothesis \(H_{ai}\) states that the coefficient for the predictor i is not equal to zero, meaning that the predictor explains the variance in the dependent variable (median house value). Stated as:
\[ H_{ai}: \beta_i \neq 0 \]
In addition to the multiple regression analysis, we conduct stepwise regression analysis to identify the most significant predictors of median house values in Philadelphia. Stepwise regression is a method that automatically selects the best subset of predictors for the model. It involves adding or removing predictors based on their statistical significance (P-value) and the AIC (Akaike Information Criterion). The stepwise regression analysis will help us identify the most important predictors and improve the overall model fits.
However, there are several limitations to stepwise regression. First, the final stepwise regression model is not guaranteed to be optimal in any specific sense. There may be other models that are as good as, or even better than the one selected by stepwise regression, but the procedure only yeilds a single model. Second, the stepwise regression does take into account researchers’ knowledge about the predictors. Important variables may be exclude from the model if they are not included in the initially. Moreover, This method also runs the risk of Type I and Type II errors—meaning it may include unimportant variables or exclude important ones.
To evaluate the performance of the regression model, we conduct K-fold cross-validation. Cross-validation is a technique used to evluate the performance of the predictive models. In K-fold Cross-Validation, the dataset is randomly divided into K equal-sized folds. The model is trained on K-1 folds and tested on the remaining fold. This process is repeated K times, with each fold serving as the test set exactly once. The average performance across all K folds is then calculated. Cross-validation helps to assess the model’s generalization performance and reduce the risk of overfitting.
In our analysis, we used K=5, which is a common choice for cross-validation. The root mean squared error (RMSE) is used to compare the performance of different models. The RMSE measures the difference between the predicted values and the actual values. A lower value of RMSE indicates a better predictive model.
The mean square errors is calculated as follows: \[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \] And the root mean square errors (RMSE) is the square root of the mean square errors (MSE): \[ \text{RMSE} = \sqrt{\text{MSE}} \]
In this analysis, we utilize the R programming language, a professional statistical software widely used for data analysis and visualization.The following libraries and packages have been employed to conduct the analysis:
tidyverse: A collection of R packages for data
manipulation and visualization.sf: A package for spatial data processing, especially
for handling shapefiles.ggplot2: A popular visualization package for creating
high-quality map and charts, part of the tidyverse
collection.ggcorrplot: A package designated to visualize
correlation matrices using ggplot2.patchwork: A package for combining multiple ggplot2
plots into a single plot.MASS: A package provides multiple functions and
datasets for statistical analysis, including linear models.caret: A package for creating predictive models and
conducting machine learning tasks, used in cross-validation and k-fold
analysis.kableExtra: A package for creating tables with advanced
formatting options.dependent_var <- "MEDHVAL"
predictors <- c("PCTBACHMOR", "NBELPOV100", "PCTVACANT", "PCTSINGLES")
summary_stats <- data %>%
dplyr::select(all_of(c(dependent_var, predictors))) %>%
summarise_all(list(Mean = mean, SD = sd), na.rm = TRUE) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
separate(Variable, into = c("Variable", "Stat"), sep = "_") %>%
pivot_wider(names_from = Stat, values_from = Value)
summary_stats$Variable <- recode(summary_stats$Variable,
"MEDHVAL" = "Median House Value",
"NBELPOV100" = "# Households Living in Poverty",
"PCTBACHMOR" = "% of Individuals with Bachelor’s Degrees or Higher",
"PCTVACANT" = "% of Vacant Houses",
"PCTSINGLES" = "% of Single House Units"
)
summary_stats <- summary_stats %>%
mutate(
Mean = round(Mean, 2),
SD = round(SD, 2)
)
summary_stats <- summary_stats %>%
arrange(Variable == "Median House Value")
predictor_rows <- which(summary_stats$Variable != "Median House Value")
dependent_rows <- which(summary_stats$Variable == "Median House Value")
# Determine the start and end rows for each group
start_pred <- min(predictor_rows)
end_pred <- max(predictor_rows)
start_dep <- min(dependent_rows)
end_dep <- max(dependent_rows)
# Create the table using kable and add extra formatting
kable(summary_stats, caption = "Summary Statistics",
align = c("l", "l", "l"), booktabs = TRUE, escape = FALSE ) %>%
add_header_above(c(" " = 1, "Statistics" = 2)) %>%
kable_styling(full_width = FALSE) %>%
group_rows("Predictors", start_pred, end_pred) %>%
group_rows("Dependent Variable", start_dep, end_dep)%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE)| Variable | Mean | SD |
|---|---|---|
| Predictors | ||
| % of Individuals with Bachelor’s Degrees or Higher | 16.08 | 17.77 |
| # Households Living in Poverty | 189.77 | 164.32 |
| % of Vacant Houses | 11.29 | 9.63 |
| % of Single House Units | 9.23 | 13.25 |
| Dependent Variable | ||
| Median House Value | 66287.73 | 60006.08 |
data <- data %>%
mutate(
LNMEDHVAL = log(MEDHVAL),
LNPCTBACHMOR = log(1+PCTBACHMOR),
LNNBELPOV100 = log(1+NBELPOV100),
LNPCTVACANT = log(1+PCTVACANT),
LNPCTSINGLES = log(1+PCTSINGLES)
)longer_version<- data %>%
pivot_longer(cols = c("MEDHVAL", "PCTBACHMOR", "NBELPOV100", "PCTVACANT", "PCTSINGLES"),
names_to = "Variable",
values_to = "Value")
ggplot(longer_version,aes(x = Value)) +
geom_histogram(aes(y = ..count..), fill = "black", alpha = 0.7) +
facet_wrap(~Variable, scales = "free", ncol = 3, labeller = as_labeller(c(
"MEDHVAL" = "Median House Value",
"PCTBACHMOR" = "% with Bachelor’s Degrees or Higher",
"NBELPOV100" = "# Households Living in Poverty",
"PCTVACANT" = "% of Vacant Houses",
"PCTSINGLES" = "% of Single House Units"
))) +
labs(x = "Value", y = "Count", title = "Histograms of Dependent and Predictor Variables") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))# histograms of the transformed variables
longer_version2 <- data %>%
pivot_longer(cols = c(LNMEDHVAL, LNPCTBACHMOR ,LNNBELPOV100,LNPCTVACANT, LNPCTSINGLES),
names_to = "Variable",
values_to = "Value")
ggplot(longer_version2,aes(x = Value)) +
geom_histogram(aes(y = ..count..), fill = "red", alpha = 0.7) +
facet_wrap(~Variable, scales = "free", ncol = 3, labeller = as_labeller(c(
"LNMEDHVAL" = "Log Median House Value",
"LNPCTBACHMOR" = "Log % with Bachelor’s Degree",
"LNNBELPOV100" = "Log # Households in Poverty",
"LNPCTVACANT" = "Log % Vacant Houses",
"LNPCTSINGLES" = "Log % Single House Units"
))) +
labs(x = "Value", y = "Count", title = "Histograms of Dependent and log transformed Predictor Variables") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))ggplot(shape) +
geom_sf(aes(fill = LNMEDHVAL), color = "transparent") +
scale_fill_gradientn(colors = c("#fff0f3", "#a4133c"),
name = "LNMEDHVAL",
na.value = "transparent") +
theme(legend.text = element_text(size = 9),
legend.title = element_text(size = 10),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8)) +
labs(title = "Log Transformed Median House Value")shpe_longer<- shape %>%
pivot_longer(cols = c("PCTVACANT", "PCTSINGLES", "PCTBACHMOR", "LNNBELPOV"),
names_to = "Variable",
values_to = "Value")
custom_titles <- c(
PCTVACANT = "Percent of Vacant Houses",
PCTSINGLES = "Percent of Single House Units",
PCTBACHMOR = "Percent of Bachelor's Degree or Higher",
LNNBELPOV = "Logged Transformed Poverty Rate"
)
plot_list <- lapply(unique(shpe_longer$Variable), function(var_name) {
data_subset <- subset(shpe_longer, Variable == var_name)
ggplot(data_subset) +
geom_sf(aes(fill = Value), color = "transparent") +
scale_fill_gradientn(
colors = c("#fff0f3", "#a4133c"),
name = var_name,
na.value = "transparent"
) +
labs(title = custom_titles[[var_name]]) +
theme(
legend.text = element_text(size = 8),
legend.title = element_text(size = 10),
legend.key.size = unit(0.3, "cm"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 15, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8)
)
})
# Combine the plots into a grid (2 columns by 2 rows)
combined_plot <- (plot_list[[1]] + plot_list[[2]]) /
(plot_list[[3]] + plot_list[[4]])
combined_plotcustom_labels <- c(
"% of Individuals with Bachelor’s Degrees or Higher" = "PCTBACHMOR",
"% of Vacant Houses" = "PCTVACANT",
"% of Single House Units" = "PCTSINGLES",
"# Households Living in Poverty" = "LNNBELPOV100"
)
predictor_vars <- data[, c("PCTVACANT", "PCTSINGLES", "PCTBACHMOR", "LNNBELPOV100")]
cor_matrix <- cor(predictor_vars, use = "complete.obs", method = "pearson")
print(cor_matrix)## PCTVACANT PCTSINGLES PCTBACHMOR LNNBELPOV100
## PCTVACANT 1.0000000 -0.1513734 -0.2983580 0.2495470
## PCTSINGLES -0.1513734 1.0000000 0.1975461 -0.2905159
## PCTBACHMOR -0.2983580 0.1975461 1.0000000 -0.3197668
## LNNBELPOV100 0.2495470 -0.2905159 -0.3197668 1.0000000
rownames(cor_matrix) <- names(custom_labels)
colnames(cor_matrix) <- names(custom_labels)
ggcorrplot(cor_matrix,
method = "square",
type = "lower",
lab = TRUE,
lab_size = 3,
colors = c("#d73027", "white", "#1a9850"))+
labs(title = "Correlation Matrix for all Predictor Variables") +
theme(plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_text(size = 7),
axis.text.y = element_text(size = 7),
axis.title = element_text(size = 8))##
## Call:
## lm(formula = LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR +
## LNNBELPOV100, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.25825 -0.20391 0.03822 0.21744 2.24347
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.1137661 0.0465330 238.836 < 2e-16 ***
## PCTVACANT -0.0191569 0.0009779 -19.590 < 2e-16 ***
## PCTSINGLES 0.0029769 0.0007032 4.234 2.42e-05 ***
## PCTBACHMOR 0.0209098 0.0005432 38.494 < 2e-16 ***
## LNNBELPOV100 -0.0789054 0.0084569 -9.330 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3665 on 1715 degrees of freedom
## Multiple R-squared: 0.6623, Adjusted R-squared: 0.6615
## F-statistic: 840.9 on 4 and 1715 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: LNMEDHVAL
## Df Sum Sq Mean Sq F value Pr(>F)
## PCTVACANT 1 180.392 180.392 1343.087 < 2.2e-16 ***
## PCTSINGLES 1 24.543 24.543 182.734 < 2.2e-16 ***
## PCTBACHMOR 1 235.118 235.118 1750.551 < 2.2e-16 ***
## LNNBELPOV100 1 11.692 11.692 87.054 < 2.2e-16 ***
## Residuals 1715 230.344 0.134
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fitted_values <- fitted(fit)
residuals_values <- residuals(fit)
standardized_residuals <- rstandard(fit)
data <- data %>%
mutate(
Fitted = fitted_values,
Residuals = residuals_values,
Standardized_Residuals = standardized_residuals)ggplot(data, aes(x = Fitted, y = Standardized_Residuals)) +
geom_point(color = "black", size= 0.4) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(
title = "Scatter Plot of Standardized Residuals vs Fitted Values",
x = "Predicted Values",
y = "Standardized Residuals"
) +
theme_minimal() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))ggplot(data, aes(x = Standardized_Residuals)) +
geom_histogram(bins = 30, fill = "black") +
labs(title = "Histogram of Standardized Residuals",
x = "Standardized Residuals",
y = "Frequency") +
theme_minimal() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))longer<-data %>%
pivot_longer(cols = c("PCTBACHMOR", "LNNBELPOV100", "PCTVACANT", "PCTSINGLES"),
names_to = "Variable",
values_to = "Value")
ggplot(longer,aes(x = Value, y = LNMEDHVAL)) +
geom_point(color = "black", size= 0.4) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
facet_wrap(~ Variable, scales = "free", labeller = as_labeller(c(
"PCTBACHMOR" = "% with Bachelor’s Degrees or Higher",
"LNNBELPOV100" = "Logged Households Living in Poverty",
"PCTVACANT" = "% of Vacant Houses",
"PCTSINGLES" = "% of Single House Units"
))) +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8)) +
labs(title = "Scatter Plots of Dependent Variable vs. Predictors",
x = "Predictor Value",
y = "Log of Median House Value")join<- data %>%
dplyr::select(POLY_ID, Standardized_Residuals)
shape <- shape %>%
left_join(join, by = c("POLY_ID" = "POLY_ID"))
ggplot(shape)+
geom_sf(aes(fill = Standardized_Residuals), color = "transparent") +
scale_fill_gradientn(colors = c("#fff0f3", "#a4133c"),
name = "Std Residuals",
na.value = "transparent") + # Choose a color palette, invert direction if needed
labs(title = "Choropleth Map of Standardized Residuals") +
theme(legend.text = element_text(size = 9),
legend.title = element_text(size = 10),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))## Start: AIC=-3448.07
## LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + LNNBELPOV100
##
## Df Sum of Sq RSS AIC
## <none> 230.34 -3448.1
## - PCTSINGLES 1 2.407 232.75 -3432.2
## - LNNBELPOV100 1 11.692 242.04 -3364.9
## - PCTVACANT 1 51.546 281.89 -3102.7
## - PCTBACHMOR 1 199.020 429.36 -2379.0
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + LNNBELPOV100
##
## Final Model:
## LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + LNNBELPOV100
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 1715 230.3435 -3448.073
lm <- trainControl(method = "cv", number = 5)
cvlm_model <- train(LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + LNNBELPOV100, data=data, method = "lm", trControl = lm)
print(cvlm_model)## Linear Regression
##
## 1720 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1376, 1376, 1376, 1376, 1376
## Resampling results:
##
## RMSE Rsquared MAE
## 0.3677552 0.6605997 0.2724688
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
cvlm_model_reduced = train(LNMEDHVAL ~ PCTVACANT + MEDHHINC, data = data, method = "lm", trControl = lm)
print(cvlm_model_reduced)## Linear Regression
##
## 1720 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1376, 1376, 1376, 1376, 1376
## Resampling results:
##
## RMSE Rsquared MAE
## 0.4432016 0.5038089 0.3175971
##
## Tuning parameter 'intercept' was held constant at a value of TRUE